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Abstract 

Background: Acupuncture has been practiced in China for thousands of years as part of the Traditional Chinese 
Medicine (TCM) and has gradually accepted in western countries as an alternative or complementary treatment. 
However, the underlying mechanism of acupuncture, especially whether there exists any difference between varies 
acupoints, remains largely unknown, which hinders its widespread use. 

Results: In this study, we develop a novel Linear Programming based Feature Selection method (LPFS) to 
understand the mechanism of acupuncture effect, at molecular level, by revealing the metabolite biomarkers for 
acupuncture treatment. Specifically, we generate and investigate the high-throughput metabolic profiles of 
acupuncture treatment at several acupoints in human. To select the subsets of metabolites that best characterize 
the acupuncture effect for each meridian point, an optimization model is proposed to identify biomarkers from 
high-dimensional metabolic data from case and control samples. Importantly, we use nearest centroid as the 
prototype to simultaneously minimize the number of selected features and the leave-one-out cross validation error 
of classifier. We compared the performance of LPFS to several state-of-the-art methods, such as SVM recursive 
feature elimination (SVM-RFE) and sparse multinomial logistic regression approach (SMLR). We find that our LPFS 
method tends to reveal a small set of metabolites with small standard deviation and large shifts, which exactly 
serves our requirement for good biomarker. Biologically, several metabolite biomarkers for acupuncture treatment 
are revealed and serve as the candidates for further mechanism investigation. Also biomakers derived from five 
meridian points, Zusanli (ST36), Liangmen (ST21), Juliao (ST3), Yanglingquan (GB34), and Weizhong (BL40), are 
compared for their similarity and difference, which provide evidence for the specificity of acupoints. 

Conclusions: Our result demonstrates that metabolic profiling might be a promising method to investigate the 
molecular mechanism of acupuncture. Comparing with other existing methods, LPFS shows better performance to 
select a small set of key molecules. In addition, LPFS is a general methodology and can be applied to other high- 
dimensional data analysis, for example cancer genomics. 
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Background 

Acupuncture, an important therapeutic method in Tra- 
ditional Chinese Medicine (TCM), has been used to 
treat various diseases for thousand years in China. 
Recently it has been gradually accepted in western coun- 
tries as an alternative or complementary treatment. 
However, how the acupuncture works remains an open 
question though acupuncture exists as one of the oldest 
continuous systems of medicine dating back 4,000 years. 
Extensive studies have been conducted on the mechan- 
ism of acupuncture to explain the effects of acupuncture 
on various systems and symptoms [1-3]. Compared to 
the relatively widespread use of acupuncture, systems 
biology is a new term to describe the recent trends in 
biology research. It emphasizes the high-throughput 
measurement of biological systems and focuses on the 
complex interactions in biological systems [4,5]. We 
highly expect that systems biology, a biology-based 
inter-disciplinary study field, will provide tremendous 
opportunities for revealing acupuncture mechanism at 
the molecular level. 

In this paper, we use systems biology method to study 
the acupuncture treatment effect by identifying a subset 
of important molecules from high-throughput metabolic 
data. Specifically, we separate the acupuncture from 
moxibustion and only study the effect of acupuncture 
on normal people by investigating the difference 
between acupuncture at particular acupoint and without 
acupuncture. Towards this aim, we utilize 1 H nuclear 
magnetic resonance (*H NMR) to investigate the effects 
of acupuncture at several meridian points on plasma 
metabolites. Then metabolite profiles (vectors) are gen- 
erated from a collection of case samples(with acupunc- 
ture at meridian point) and control samples (without 
acupuncture). These high-dimensional profile data is 
very similar to SNP (sequence data), gene expression 
(transcriptome), mass spectrum (proteome), and small 
molecules (metabolome) data in different levels. Then 
the straightforward task is to identify differentially 
expressed molecules and further classify and predict the 
diagonostic category of a sample, based on its metabolite 
profile [6]. 

Generally speaking, there are two difficulties in analyz- 
ing these high-dimensional profile data. First, a large 
number of features (metabolites in our case) are avail- 
able to predict classes for a relatively small number of 
samples. The presence of a significant number of irrele- 
vant features that are unrelated to the case status makes 
such analysis prone to the curse of dimensionality. Sec- 
ond, predictive accuracy is not the only goal and further 
biological validation and mechanism understanding call 
for explanatory power other than black box predictive 
results. Thus it is especially important to know which 



molecules largely contribute towards the classification. 
Ideally we can improve the generalization performance 
of classifier by identifying only the molecules that are 
significantly contribute to the classifier. This effect is 
attributable to the overcoming of the curse of dimen- 
sionality. For example, if it is possible to identify a small 
set of metabolites that is indeed capable of providing 
complete discriminatory information, inexpensive diag- 
nostic assays for only a few metabolites might be devel- 
oped and be widely deployed in clinical settings. 
Knowledge of a small set of diagnostically relevant 
metabolites may provide important insights into the 
mechanisms responsible for acupuncture treatment 
itself. Those molecules are usually termed as biomar- 
kers. The procedure to reveal them is referred as feature 
selection, biomarker identification, or feature ranking. 

Feature selection is known to be NP-hard [7] and the 
search becomes quickly computationally intractable. 
Suppose we treat the feature selection task in a brute 
force way. Given n features, we need to select m fea- 
tures which can get the best classification accuracy (m 
«ri) regarding to a predefined cost function. Usually in 
classification or prediction problem, the cost function is 
selected as the accuracy of the prediction. The exhaus- 
tive search method goes through all the possible combi- 
nations, with the computation complexity 0(n m ). Thus, 
this method is not practical for realistic applications. 

Existing feature selection strategies can be roughly 
categorized into three types [8]. Considering the partial 
ordering properties of the subset space, we can either 
start with an empty set and successively add features, or 
start with the set of all features and successively filter 
them. The former type is referred to as forward selec- 
tion while the latter is referred to as backward elimina- 
tion. The third type is the combination of the two 
approaches. However, all the above methods relies on 
the greedy strategy. As an example of forward feature 
selection, we might first look for the single most discri- 
minative feature using any classifier. Then we could 
search the single additional feature that gives the best 
class discrimination when considered along with the 
first feature. Keeping augmenting the feature set itera- 
tively in this greedy fashion, we stop until cross-valida- 
tion error estimates are minimized. As a result, the 
global optimal solution usually cannot be guaranteed. 

In this paper, we proposed a novel linear program- 
ming (LP) model to address this important problem. 
Feature selection problem is cast into an optimization 
problem with two objectives, one is to minimize the 
number of chosen features and the other is to maximize 
the predictive accuracy based on the centroid classifica- 
tion framework. In other words, our feature selection 
method simultaneously improves classification accuracy 
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and selects features. Comparing with several state-of- 
the-art feature selection methods, our Linear Program- 
ming based Feature Selection (LPFS) method can select 
a small set of features by applying strong regularization 
while keeping high accuracy. We then apply our method 
to analyze the metabolite profile data generated for acu- 
puncture treatment. We identify important molecules 
(biomarkers) related to the acupuncture treatment for 
several meridian points. Further characterization of the 
biomarkers and the common and difference among sev- 
eral meridian points provide biological insights for acu- 
puncture mechanisms at molecular level. Preliminary 
results in this paper were presented in our conference 
paper [9]. In this extended paper, we further provide the 
detailed derivation of the method and comparison with 
existing works. In addition, we performed more analysis 
and descriptions for the biological insights of the acu- 
puncture biomarkers. 

Method 

Analytic workflow 

In this paper, the acupuncture treatment effect is inves- 
tigated in the framework of systems biology. The basic 
analytic workflow is shown in Figure 1. As the first step, 
metabolite profiles are originally generated by 1 H NMR 
from control samples and the samples with acupuncture 
treatment in meridian points. Then we develop a linear 
programming based feature selection method to com- 
pare the two groups of metabolite profiles. Finally, a 
small set of metabolites are selected as biomarkers for 
acupuncture treatment effect. 



Overview of the linear programming based feature 
selection 

To investigate the high-dimensional data for acupunc- 
ture treatment effect, we develop a novel method, LPFS, 
to select a small set of metabolites to characterize acu- 
puncture treatment effect. The schematic illustration of 
LPFS is shown in Figure 2. LPFS performs feature selec- 
tion based on the nearest centroid classifier. On one 
hand, we want to attain the best classification accuracy 
by minimizing the loss function. On the other hand, fea- 
ture selection algorithms should be robust to noise and 
outliers in the data by applying strong regularization. 
Here we use the parsimony principle (also known as 
Occam's razor) by minimizing the number of selected 
features. Then the feature selection problem is formu- 
lated as a multi-objective programming. The next step is 
to convert this multi-objective programming into a sin- 
gle-objective linear programming by applying the e 
method. After solving the linear programming model in 
an efficient way, the optimal features can be selected. 
Centroid classification prototype 

A fast and simple algorithm for classification is the cen- 
troid method [6,10]. This algorithm assumes that the 
target classes correspond to individual (single) clusters 
and uses the cluster means (or centroids) to determine 
the class of a new sample point (see Figure 2). A proto- 
type pattern for class C ; is defined as the arithmetic 
mean: 
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Figure 1 Our analytic workflow to identify biomarkers for acupuncture treatment effect. Metabolite profiles are originally generated by H 
NMR from control samples and the samples with acupuncture treatment. Then we develop a linear programming based feature selection 
method to compare the two groups of metabolite profiles. Finally a small set of metabolites are selected as biomarkers for acupuncture effect. 
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Nearest centroid classifier 




Acupunctured samples 



Object one: minimizing 
the classification error 
(loss function) 



Control samples 



Object Two: minimizing the 
number of the selected 
features 



Multi-objective optimization problem 



J L £ -method 

Linear programming model (LPFS) 

Figure 2 The schematic illustration of LPFS. Under the nearest centroid framework, LPFS requires to minimize the classification error and the 
number of selected features, which leads to a multi-objective programming. To ensure the computational efficiency, a linear programming is 
solved to identify biomarkers. 



where 5; is the i-th training sample labeled as class 
Cj. Recall that the training sample is a metabolite spec- 
tra represented as a multi-dimensional vector (denoted 
in bold). In a similar fashion, we can obtain a prototy- 
pical vector for all the other classes. During classifica- 
tion, the class label of an unknown sample s is 
determined as: 



Li[s,fi) = ||s- /x| 



(4) 



C(s) = argmindis(/L(-c ,s) 



where dis{x, y) is a distance function or: 



C(s) = argmaxsim(/xc ,s) 



(2) 



(3) 



where sim(x, y) is a similarity metric. This simple clas- 
sifier will form the basis of our LPFS method. The 
advantages is that it works with any number of features. 
And its run-time complexity is proportional to the num- 
ber of features and the complexity of the distance or 
similarity metric used. According to the experiments in 
[11], we select L\ distance metric, which is robust to 
outliers and most appropriate for the centroid classifica- 
tion algorithm. It is defined by: 



with HjHi = Z, \y{i)\, and y{i) being the i-th element of 
vector y. The value Li(s, fi) has a linear cost in the num- 
ber of features. In this study, data sets contain two 
classes and hence the number of calls to the distance 
metric is also two. Therefore, the centroid classifier, at 
run-time, is linear in the number of features. During 
training, two prototypes are computed and the cost of 
computing each prototype is O(mN), where N is the 
number of features and m is the number of training 
samples which belong to a given class. Note that m only 
varies between data sets and not during training or fea- 
ture selection processes. Thus, we can view m as a con- 
stant and the centroid classifier has O(N) cost in the 
training phase. 

Multi-objective optimization model 

Suppose we have two groups in the training dataset, the 
case group and the control group as the gold-standard 
data to classify new samples. We denote them set T and 
F respectively. Supposing \ T\ = m u \F\ = m 2 , and the 
computed centroids are fi T and fi F respectively. A simple 
classification scheme is as follows. Given a new sample 
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s, we want to decide which group it belongs to. The L 1 
discrepancy between the sample s and the groups T and 
F can be calculated as ||s - fCr\\i and \\s - fir\\i- Thus a 
simple rule is 



se T if \\s-(i T \\i < lis — mfNi 



s € F if ||s- nrh > MfIIi 



(5) 



(6) 



Let the feature number be n. We introduce the vari- 
ables for feature selection as x = {x\, x 2 , -, x„), where x t 
= 0,1. When x t = 1, it means feature i is selected in the 
biomarker set. Otherwise it is not selected. 

Suppose the test dataset is U. And it is composed by 
the case group U T and control group U F . U = U T U U F , 
and \Ut\ = h, \U F \ = l 2 . Given a case sample Si = (sn, 
sn, —, sin), It {1, 2 Zi}, if it is classified correctly, we 
should have 



E 

i=i 



mi 

S 'i ~ E f -)'/ mi 



5 « " J2fj'/ m 2 



(7) 



Where = (fo, fo, e T, k = 1, 2, m 1 andf k 

= {fki,fia, -,fkn) £ F, k = 1,2, m 2 . 

Similarly given a control sample 5/ = (sn, s !2 , S;„), Z 
e {Z x + 1, Zi + 2 Zi + Z 2 }, if it is classified correctly, we 
should have 



E 



i=l 



E-fr/™ 2 



(8) 



With the above constraints for variable x = {x lt x 2 , -, 
x n ), the objective function is to choose as few as fea- 
tures, i. e., 



mm 

X\ ,X2 1 — /X) 



in T E Xi 



(9) 



i=i 



Thus the feature selection problem is formulated as an 
integer linear programming problem in Equation (10). 



mm 



pi = (Pu,si2, -• • ,pi«) e Lfr,le {1,2,--- (^q) 

Eili k - E™ 1 ! W m ik > k - Y^lj'<i m \ x < 

pi = (pii.sc,--- ,pi„) e tWe {1,2, ••• ,/ 2 ), 
x, = 0, 1, ie {1,2,- •• ,«) 

When we consider the noise in the measured data, not 
all the test samples can be classified exactly. We intro- 
duce the tolerable error y = {y lt y 2 , ya+i 2 } for every 
sample in U T U U F . And y t > 0, i e {1,2, Zi + Z 2 }. 
When yi is not equal to zero, it means sample i is 



wrongly classified. Otherwise this sample should be cor- 
rectly classified. 

Given a case sample Si = (sn, S[ 2 , Si„), I e {1, 2 
Zi}, we should have the following constraint considering 
the tolerable error 



E 

;=i 



■u - E c i'/ m i 



X i ~ Yl < E 



1=1 



* - E^/ m2 



Xi (11) 



Similarly given a control sample S/ = (s/i, S[ 2 , Si„), I 
e {Zi + 1, Zi + 2 Zi + Z 2 }, we should have the following 
constraint considering the tolerable error 



E 

i=i 



E f j'/ mi 



Xj + y ( 



E 

i=i 



su 



Xj(12) 



Thus the objective function composes two parts, i. e., 

n 

we want to choose as few as features mm XltXli ... iXn ^x; 

i=i 

and at the same time we want to reduce the classifica- 

Vi . In general, 

,„!« 2 1=1 

there is a trade-off relationship between the classifica- 
tion error and the number of features. Hence, the fea- 
ture selection problem can be formulated as a multi- 
objective optimization problem with discrete variables x 
(xi, x 2 , x„) and continuous variables 

Y = {yi,Y2, ■ ■ ■ 'Yh+h) as shown in Equation (13). 



vector - minimize^) 
subject to 



(11)(12) with I,- e {0, € {1,2, ■•• ,n), 
Yi > 0,i€ {1,2,... ,h + h) 



(13) 



The first term of objective function in Equation (13) is 
to minimize the number of chosen features, and the sec- 
ond one is to minimize the total classification error. 
Mixed integer linear programming 

The optimal solutions of the two-objective optimization 
problem consist of a Pareto set, which can be solved by 
transforming the two objectives of (13) into a single 
objective. One typical technique is the e-method, which 
alternates a positive scalar parameter A to obtain the 
Pareto set, with the formulation in Equation (14). 

min ^ ^ x, + X ^ 1 1 * y, 

s ' £ - £"„, \f" - Z^'i W m i \ x < -n< Z!",, \p» - T!" = \ fi<t mi \ x > 

pi = (pii.sc,--. ,pin) eU T ,le {1,2,... ,h], (1^4) 

pi = (pit-io, • • • ,pin) e U F ,le {1,2,- •• ,! 2 ), 
x, = 0,l, i e {1,2,... ,n},yj > 0, j e {1,2, • • • , h + h) 
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(14) is a mixed integer programming (ILP). The objec- 
tive function in (14) is y Xj + k y.' 2 yt ■ Theoreti- 
cally, we can obtain all optimal solutions belonging to 
the Pareto set by changing the parameter A for the sin- 
gle-objective optimization problem (14). Clearly, X trans- 
forms the number of chosen features into equivalent 
classification error in (14), and controls the balance 
between them. 

By solving the proposed mixed integer linear program- 
ming model (14), we can get solutions for the feature 
selection variables x t , i e {1, 2, • • •, «}, and classification 
error variables y jt j e {1, 2, l x + I 2 }. Checking if x t is 
equal to 1, we can know if the corresponding feature 
should be selected in the classifier. Meanwhile checking 
the values of all the yj, we can estimate the classification 
accuracy. For example, suppose the number of all / such 
that yj = 0 is 7\Ti and the number of all / such that j y > 0 
is N 2 . We can simply calculate the classification accu- 
racy by Ni/^+h and N 2 ll x +l 2 . 
Leave-one-out cross validation 

The above model (14) is based on the general idea of 
cross validation, thus it depends on the choice of T and 
F. We noticed that there are different ways to do cross 
validation in feature selection [12]. One way is that the 
feature selection is done with all the samples and the 
cross-validation is only done for the classification proce- 
dure. This kind of cross validation may severely bias the 
evaluation in favor of the studied method due to "infor- 
mation leak" in the feature selection step. Another way 
is to include the feature selection procedure in the cross 
validation, i.e., to leave the test sample(s) out from the 
training set before undergoing any feature selection. 
Our feature selection model allows the freedom to 
choose the suitable cross validation procedure according 
to the practical need. If more information is preferred to 
select biomarkers due to the scarcity of samples, we can 
use all the samples to estimate the cross validation error 
in the following resubstitution validation in Equation 
(15). 

En , ir-^h+h 

Xi + X > v.- 
i=l ' 

En I ir-^ini , ir-^n I r-^m 2 

pi = [pn.sa,--- ,pin) e T,!e {1,2, - - - ,mi), (15) 

XX, \p" - t i'/ mi \ x < + y> > XX, \p« - Z!™i-fr/ m2 |*' 

pi = (pii.sn,--- ,pin) eF.le {1,2,- •• ,l 2 ), 
Xi = 0,l, ie {1,2,--- ,n},yj > 0, j e {1,2, • • • , /, + / 2 ) 

Resubstitution error rate indicates only how good are 
our biomarkers on the training data. However, this model 
has "information leak" and will underestimate the classifi- 
cation error. In our implement, we choose the model for 
leave-one-out cross validation in Equation (16). 



ei, \ p " ~ e"t' ~ ij | *< ~ n * el jc» - i^jW"* " 

Pi- (Pa,*, ••,(»,) eT.le (1,2,. ■•,M,ft- [/tutu,- •,!*,) 6 T\(plUe {1,2,- ■ ,rai)\{l) (Ig) 

El 1 |''* "E* 1 , W m ij^ + y > E" =1 | p '' 'M m 2- 

pl-(pli,so,--,ph)EF,Ie(l,2,--.,I 2 ),/».f/M,/ l i,.-.,/ ln )€F\()ii),te(l,2,.-.,m 1 )\(l] 
«i-0,l, i6(l,2,...,n),c>0, ;e(l,2,...,li*y 

We adopt leave-one-out experiment since this particu- 
lar form of cross validation is an unbiased estimator of 
the generalization performance of classifier. It makes the 
best use of the available data and involves no random 
subsampling. Every time we pick out one sample (/ x = 1 
or l 2 = 1) from the training data and try to classify it 
correctly. And by doing m l + m 2 times test we add m 1 
+ m 2 constraints. 
Linear programming approximation 

In general, mixed integer linear programming is difficult 
to solve. To ensure the computational efficiency, mixed 
ILP in Equation (16) can be relaxed to the correspond- 
ing linear programming (LP). Linear programming is the 
simplest type of mathematical programming and has 
been widely used in systems biology study [5,13,14]. 
Therefore, sophisticated algorithm can be adopted to 
efficiently solve this LP. In terms of computational com- 
plexity, the proposed approach makes the computation 
of biomarker tractable. Finally we construct the LPFS 
model in Equation (17). 

El, I"' - E™, ' '*/("■ - !)| i - n < El,, |c» - E™ ffi m '\ i 

Pi- ipli.su, •••,Pk) eT.le (1,2,.., Ii>, ft- (ft,, te, ■,(*,) eT\(piU e(l,2, ■■,mi]\(l) (^7) 

EL I** ~ E"', W" , i|*s*» > E"., |<" ~E"', '/j/("2- 

pl-(pli,so,--,Ph)eF,Ie(l,2,...,I 2 ),/,,.r/„,/ K ,...,/ fc )ei>\(p,),te(l,2,..., mj l\(ll 
xi>0, ie(l,2,-.,n},y,>0, ; e {1,2,. ,h + fe) 

After relaxing to continuous value, the value of the 
optimal solution of Xi (LPFS score) indicates the impor- 
tance of feature i in the nearest centroid classifier. It 
should be noted that we can use other distance defini- 
tions instead of L l in our model to achieve the non- 
linear classification effect. The parameter X can be 
determined by checking the output leave-one-out pre- 
dictive accuracy. We also notice that LPFS model can 
be extended to multi-classification task and n-fold cross 
validation. 

Metabonomics profiling by 'H NMR spectra 

Venous blood (3ml) was collected into a heparin sodium 
tube and the plasma was collected by centrifugation at 
lOOOx g at 4°C for 10 minutes. An aliquot of 300 ft\ 
plasma was mixed with 250 {A D a O and 50 pi TSP (3- 
trimethylsilyl- 2 H 4 -propionic acid) in D 2 0 (1 mg/ml) in 5 
mm NMR tube. The D 2 0 provided a field-frequency 
lock solvent for the NMR spectrometer and the TSP 
served as an internal reference of chemical shift. H 
NMR spectra of the plasma samples were acquired on a 
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Varian INOVA 600 MHz NMR spectrometer at 27°C by 
using Carr-Purcell-Meiboom-Gill (CPMG) spin-echo 
pulse sequence, with a total spin-spin relaxation delay 
(2nr) of 320 ms. The free induction decays (FIDs) were 
collected into 32K data points with a spectral width of 
8000 Hz and 64 scans. The FIDs were zero-filled to 
double size and multiplied by an exponential line-broad- 
ening factor of 0.5 Hz prior to Fourier transformation 
(FT). In addition, diffusion-edited experiments were also 
carried out with BPP-LED (bipolar pulse pair longitudi- 
nal eddy current delay) pulse sequence [15,16]. The gra- 
dient amplitude was set at 35.0 G cm' 1 , with a diffusion 
delay of 100 ms. A total of 128 transients and 16k data 
points were collected with a spectral width of 8000 Hz. 
A line-broadening factor of 1 Hz was applied to FIDs 
before Fourier transformation. 

All plasma 1 H NMR spectra were manually phased 
and baseline corrected using VNMR 6.1C software (Var- 
ian, Inc.). For CPMG spectra, each spectrum over the 
range of 0.4 to 4.4 was data-reduced into integrated 
regions of equal width (0.01 ppm). For BPP-LED data, 
each spectrum over the range of 0.1 to 6.0 was segmen- 
ted into regions of equal width (0.01 ppm). The regions 
containing the resonance from residual water (4.6-5.1) 
were excluded. The integral values of each spectrum 
was normalized to constant sum of all integrals in a 
spectrum in order to reduce any significant concentra- 
tion differences between samples [17,18]. Identification 
of metabolites in spectra was accomplished based on lit- 
eratures and the Chenomx NMR Suite 5.0 (Chenomx, 
Calgary, Canada). 

Results 

Metabonomics data generation 

To investigate the acupuncture treatment effects, we ori- 
ginally obtained metabonomics data of plasma metabo- 
lites in healthy males at five meridian points using 
Proton NMR. Proton NMR (also named as Hydrogen- 1 
NMR, or X H NMR) applies nuclear magnetic resonance 
in NMR spectroscopy with respect to hydrogen-1 nuclei 
within the molecules of a substance, in order to deter- 
mine the structure of the molecules [19]. 

As a result, most organic compounds are character- 
ized by chemical shift values, which are usually 
expressed in parts per million (ppm) by frequency and 
are in the range +14 to -4 ppm. Chemical shift values 
are not precise, but typically they are regarded mainly as 
orientational. The exact value of chemical shift depends 
on molecular structure and the solvent in which the 
spectrum is being recorded. These chemical shift values 
can be mapped to eight metabolic subsets (amino acids, 
carbohydrates, energy, glycans, lipids, nucleotides, sec- 
ondary metabolites/xenobiotics, vitamins, and cofactors). 
In our experiment, 400 chemical shift values are 



measured for their concentration in plasma, and mathe- 
matically every sample is represented by a vector in 400 
dimensional space. 

Fifty healthy young males were randomly allocated to 
Zusanli (ST36), Liangmen (ST21), Juliao (ST3), Yangling- 
quan (GB34), and Weizhong (BL40) groups (The loca- 
tions of the meridian points are shown in Figure 3a. 
Among the five points, Zusanli, Liangmen, and Juliao are 
on the same meridian.). Each group contains 10 persons. 
Inside each group the corresponding meridian points 
were separately acupunctured for 5 consecutive days. In 
addition, twenty healthy young males are recruited as the 
blank control groups. All the twenty people are measured 
before the start of 5 consecutive days and additionally ten 
of them are measured after 5 consecutive days. Fasting 
venous blood was taken in all the subjects. Plasma meta- 
bolites were measured by 1 H NMR to derive metabolic 
profiles (see details in Method Section). Furthermore to 
exclude possible noises, all the seventy males are strictly 
trained to make sure their metabolic profiles are mea- 
sured in very similar conditions. The detailed experimen- 
tal method can be found in [20]. In summary, we have 80 
samples grouped into Zusanli (10 samples, acupuncture 
point ST36), Liangmen (10 samples, acupuncture point 
ST21), Juliao (10 samples, acupuncture point ST3), Yan- 
glinquan(10 samples, acupuncture point GB34), Weiz- 
hong (10 samples, acupuncture point BL40), Control I 
(10 samples, normal people measured after the consecu- 
tive 5 days), and Control II (20 samples, normal people 
measured before the consecutive 5 days). 

Classification experiments design 

With the data, we design experiments to identify bio- 
markers for the acupuncture treatment of each meridian 
point. The overall design of biomarker identification 
experiments is shown in Figure 3b. We categorize eighty 
samples into 8 groups shown as the circles in Figure 3b. 
ST36, ST21, ST3, GB34, and BL40 each has 10 samples. 
The 20 samples in Control II are naturally decomposed 
into two groups with equal size, Prel (10 samples with 
follow-up measurement after 5 days) and Pre2 (10 sam- 
ples without follow-up measurement). Treating the Prel 
as the common control set, we have seven classification 
tasks (Expl to Exp7) shown as the lines in Figure 3b. 
For example, task Expl tries to identify a subset of 
metabolites to classify Prel as the control and ST36 as 
the case. In this way, Expl to Exp5 aim to identify the 
biomarkers for acupuncture treatment on ST36, ST21, 
ST3, GB34, and BL40 respectively. While Exp6 tries to 
capture the metabolite change by 5 consecutive days 
without acupuncture. And Exp7 tries to test if there are 
significant metabolite change for the people under simi- 
lar condition. Exp6 and Exp7 serve as the control stu- 
dies to guarantee the significance of our result. 
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a) 




b) 



Control I (10 samples) 



ST36(10 samples) 



ST21(10 samples) 



ST3(10 samples) 



GB34 (10 samples) 



BL40(10 samples) 



Figure 3 Overall design of the biomarker identification experiments, a) Metabolite profiles are originally generated by NMR from five 
acupuncture points (Zusanli (ST36), Liangmen (ST21), Juliao (ST3), Yanglingquan (GB34), and Weizhong (BL40)). b) The metabolite profiles are 
grouped into 7 sets and the biomarker identification problem is designed as 7 binary classification experiments. 



Pre II (10 samples) 



Global characterization of the data 

We first perform hierarchical clustering on the 80 meta- 
bolic profiles. The results are shown in Figure 4a. If the 
samples can be clearly discriminated by global pattern, the 



80 samples should be clustered by with or without acu- 
puncture treatment and then by their meridian points. 
However, all the sample labels are mixed in the clustering 
result (Figure 4a) and we cannot see clearly boundaries. 
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Figure 4 Global characterization of the metabolite profiles, a) Hierarchical clustering of the metabolic profiles of the 80 samples, b) 
Centroids for the seven datasets. The horizontal units are expression values for the metabolites. The metabolites are sorted by their chemical 
shift values. 



Furthermore, we calculate the centroids for the seven 
groups of samples in Figure 4b by averaging the 10 sam- 
ples for their metabolite expression values. These cen- 
troids are plotted side by side in Figure 4b, which shows 
that these centroids are very similar and it's very diffi- 
cult to detect the global difference. 

The above results together demonstrate that global 
pattern in metabolic profiles cannot discriminate the 
Zusanli, Yanglingquan, Liangmen, Juliao, Weizhong, 
Prel, Pre2, and Control I groups. Thus it is necessary to 
find the local pattern in the profile data. Our strategy is 
to find a subset of metabolites as biomarkers to achieve 
clear discrimination. 

Comparison with other approaches 

Before we conduct the acupuncture biomarker identifi- 
cations, we benchmark our LPFS method by comparing 
with several existing state-of-the-art methods. There are 
many existing feature selection methods and they can 
be roughly categorized into three types, filter, wrapper, 
and embedded methods. To make the comparison sim- 
ple and comprehensive, we pick out some representative 
methods in each type to compare in the same dataset. 

Filter methods select features as a preprocessing step 
and feature selection part is independent of a machine 
learning algorithm (classifier). This is computationally 
efficient. Fold change and t-test are the simplest and 
popular methods to identify biomarker. They are usually 
the representative methods for filter methods. 

Let %ij and j, y denote the log expression values of 
metabolite i in sample / in the case and control, 



respectively. We define the ordinary two-sample t-statis- 
tic [21] as 



(18) 



Where Xi,fi, and s, are the mean of case, mean of 
control, and the standard deviation of the samples for 
metabolite i. From t-statistic Tj we can easily calculate a 
p-value. Usually a feature is selected if its corresponding 
p-value is smaller than a predefined threshold 0.05. 

The standard definition of the fold-change [21] for 
metabolite i is 



FQ = =- 



(19) 



Where Xy and fu are the raw expression values of 
metabolite i in sample / in the case and control, respec- 
tively. In our implementation, we computes the differ- 
ence of means (i.e. the fold-change for log-transformed 
data) and then rank the metabolites by their absolute 
values. We choose a cutoff 2 to select the significant 
ones. 

On the other hand, wrapper method ranks features 
based on their effects on classification accuracy. It takes 
dependencies of the feature subset on the learning algo- 
rithm into account and is computationally more 
demanding. Support Vector Machine-Recursive Feature 
Elimination (SVM-RFE) is one of the most successful 
wrapper method based algorithm in the feature selection 
[22]. SVM-RFE conducts feature selection in a 
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sequential backward elimination manner, which starts 
with all the features and discards one feature at a time 
by checking the SVM accuracy. It has been widely used 
and extended in high-dimensional data analysis [23]. We 
compare our LPFS method with SVM-RFE in metabolite 
data. 

Since our LPFS method is an embedded method and 
simultaneously optimize classification accuracy and the 
number of selected features, we specifically choose to 
compare with an existing method with similar strategy, 
called sparse multinomial logistic regression approach 
(SMLR). It was developed to jointly and simultaneously 
identify the optimal nonlinear classifier, and select the 
optimal set of features via the optimization of a single 
posterior objective function (see [24] and [25]). SMLR 
has been extensively applied in problems in systems 
biology [26]. SMLR is freely available at http://www.es. 
duke.edu/~amink/software/smlr/ and we take the default 
values for the parameters in our calculation. 

Without loss of generality, we take Expl in Figure 3b as 
an example to identify a subset of metabolites to discri- 
minate ST36 and Prel. We select biomarkers from the 
metabolic profiles and compare the results in two ways. 

Firstly, we compare different methods in Figure 5, by 
assessing the quality of the selected biomarkers. We 
simply plot all the metabolites by their standard 



derivation versus the difference of mean expression 
value. This is a scatter-plot used in [27], which plots sig- 
nificance versus fold-change on the y- and x-axes, 
respectively. Conceptually, it is very similar to the vol- 
cano plot in statistics [28]. Plotting points in this way 
results in two regions of interest in the plot: those 
points that are found towards the bottom of the plot 
and far to either the left- or the right-hand side. These 
represent values that display large magnitude fold 
changes as well as high statistical significance with small 
standard derivation. 

The t-test based method identifies 172 metabolites if 
we choose a cutoff 1.73 (corresponding p-value 0.05). A 
strict threshold will still select 84 metabolites with cutoff 
2.84 (corresponding p-value 0.005). We list the top 10 in 
Table 1 and plot them in Figure 5a. We find that the 
ordinary t-statistic selects metabolites with low standard 
deviations. 

The fold change based method identifies 159 metabo- 
lites if we choose a commonly used cutoff 2 [28]. And 
there are still 97 metabolites by choosing a cutoff 4. 
Again the top 10 are listed in Table 2 and plotted in 
Figure 5b. It's clear that the fold-changes select metabo- 
lites with large shifts between control and treatment. 

While SMLR selects 37 features to achieve the 100% 
leave-one-out predictive accuracy. These 37 metabolites 




a) 




d) 



e) 



Figure 5 



Figure 5 Methods comparison via volcano like plots. Comparison of our LPFS method with existing methods regarding to the identified 
biomarkers. All the 400 metabolites are plotted into a two dimensional plane. The selected biomarkers are highlighted in red. The x-axis denotes 
the difference of means and the y-axis denotes the standard derivation. Good biomakers should locate either in the left bottom corner or in the 
right bottom corner, a) volcano like plot of t-test method. The top 10 features are in red. b) volcano like plot of fold change method. The top 10 
features are in red. c) volcano like plot of SMLR method, d) volcano like plot of SVM-RFE method, e) volcano like plot of our LPFS method. 
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Table 1 The top ten identified biomarkers by different methods on the ST36 meridian point. 





Student t-test 




Fold change 




SMLR 


SVM-RFE 






Our LPFS method 


ID 


nnm 


t-score 


ID 


rr" M 


FC-score 


ID 


ppm 


ID 


nnm 
rr IM 


ID 


ppm 


LPFS score 


Metabolite name 


86 


3.55 


15.29 


86 


3.55 


73.48 


45 


4 


86 


3.55 


86 


3.55 


0.015 




i yz) 


2.46 


11.91 


Q7 
0/ 


3.54 


68.52 


£n 
_>U 


j.yj 


68 


3.73 


m 

yz 


3.4y 


0.008 




251 


1.9 


11.07 


308 


1.33 


46.58 


52 


3.93 


87 


3.54 


87 


3.54 


0.006 


a-glucose/glycine 


45 


3.96 


10.96 


310 


1.31 


41.61 


58 


3.87 


69 


3.72 


308 


1.33 


0.002 


lactate 


229 


2.12 


10.86 


70 


3.71 


40.75 


60 


3.85 


102 


3.39 










81 


3.6 


10.80 


92 


3.49 


38.61 


67 


3.78 


70 


3.71 










18 


4.23 


10.03 


293 


1.48 


37.45 


68 


3.77 


295 


1.46 










127 


3.14 


9.75 


295 


1.46 


33.26 


69 


3.76 


116 


3.25 










17 


4.24 


9.71 


71 


3.7 


28.55 


70 


3.75 


229 


2.12 










232 


2.09 


9.35 


69 


3.72 


26.30 


71 


3.74 


88 


3.53 











are plotted in Figure 5c and 10 of them (ranked by ID) 
are listed in Table 1. Figure 5c demonstrates that SMLR 
is quite efficient to cover almost all the metabolites with 
large fold change and small standard derivation. How- 
ever, 37 metabolites is too much from the viewpoint of 
practical usage of biomarkers. 

SVM-RFE selects 15 metabolites in total to achieve the 
100% leave-one-out predictive accuracy. These metabo- 
lites are illustrated in Figure 5d and the top ten are 
listed in Table 1. Figure 5d shows that SVM-RFE favors 
the metabolites with small standard derivation. 

Our LPFS method finally selects 4 features as the bio- 
markers to discriminate ST36 and Prel. By using only 4 
features we can achieve 100% leave-one-out predictive 
accuracy. To show these four important biomarkers are 
not dependent on the nearest centroid classifier, we use 
SVM to do five-fold cross validation, the predictive accu- 
racy is still 100%. This demonstrates that we can select a 
small set of important features really matters by applying 
strong regularization. The selected 4 metabolites are 
listed in Table 1 and scatter plotted in Figure 5e. Figure 
5e also shows that our LPFS method tends to reveal the 
metabolites with small standard deviation and large 
shifts, which exactly serves our requirement for good 



biomarker. The four metabolites are with ppm 3.55, 3.54, 
3.49, and 1.33. ppm 3.54 and 1.33 are annotated as a-glu- 
cose/glycine and lactate respectively. The expression level 
of ppm 3.54, 3.49, and 1.33 goes down after acupuncture 
treatment while the expression level of ppm 3.55 goes up. 

Secondly, we compare the results of these five meth- 
ods in a venn diagram in Figure 6. Figure 6a shows the 
overlap among the results from fold change, SVM-RFE, 
SMLR, and LPFS. The 4 metabolites selected by our 
LPFS, metabolites with ppm 3.55, 3.54, 3.49, and 1.33, 
also correctly selected by fold change, SVM-RFE, and 
SMLR. Interestingly, all the results obtained by SVM- 
RFE, SMLR, and LPFS are included in the results by 
fold change. Figure 6b shows the overlap among the 
results from t-test, SVM-RFE, SMLR, and LPFS. Again, 
metabolites with ppm 3.55, 3.54, 3.49 are consistently 
supported by SVM-RFE, SMLR, and t-test. Metabolite 
with ppm 1.33 is not included in the t-test result but 
supported by SVM-RFE and SMLR. SVM-RFE and 
SMLR also select some metabolites which are not 
included by t-test results. The venn diagram demon- 
strates that fold change method is more consistent with 
current feature selection methods when the sample 
number is not so large. 



Table 2 Identified biomarkers from different meridian points by our LPFS method. 
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Figure 6 Venn diagram for the results obtained by different methods, a) Comparing t-test, SMLR, SVM-RFE, and LPFS methods by checking 
the overlaps of their selected biomarkers. b) Comparing fold change, SMLR, SVM-RFE, and LPFS methods. 



In addition to the overall venn diagram, the top ten 
biomarkers obtained by the t-test, fold change, SVM- 
RFE methods are summarized in Table 1. When com- 
paring the overlap with top 10 list, LPFS still gets con- 
sistent results with other methods. Metabolite with ppm 
3.55 has a t-test score 15.29 and fold change score 
75.38. It is ranked the first by all the three methods 
with rank information. Metabolites with ppm 3.54 and 
3.49 rank high in fold change, SVM-RFE. ppm 1.33 
ranks high in the result obtained by fold change. 

Biological insights for the identified biomarkers 

We then applied the proposed LPFS method to identify 
the biomarkers from the designed seven experiments. 
As a result, we identified 4, 7, 2, 3, and 8 biomarkers for 
the acupuncture treatment effects of ST36, ST21, ST3, 
GB34, and BL40 respectively. These selected biomarkers 
can achieve 100%,100%,100%,100%, and 95% leave-one- 
out cross validation accuracy. The results are summar- 
ized in Table 2. As expected, Exp7 fails to find any bio- 
markers. Exp6 finds several metabolites due to the fact 
that the expression values of these metabolites vary after 
consecutive 5 days. So we carefully check the obtained 
metabolites list by Exp6 and exclude these metabolites 
in our final results. Some biomarkers identified in Table 
2 are annotated as glucose and lipid. Most of them need 
further investigation on their chemical structures and 
biological functions. 

From Table 2, we can see that acupuncture at Yangm- 
ing meridian points (including acupuncture points at 
ST36, ST21, and ST3) influence mainly plasma micro- 
molecular metabolites and was closely related to energy 
metabolism pathway. Acupuncture at Yanglingquan 
influences mainly plasma macromolecular metabolites 
and is closely related to lipid metabolism and transport. 
Acupuncture at Weizhong doesn't largely influence 
plasma metabolites. This study suggests that Yangming 



meridian points have certain characteristics, which are 
different from those of both Yanglingquan and Weiz- 
hong. Metabonomics techniques based on H NMR and 
biomarker identification method provide experimental 
evidence for distinguishing between Yangming meridian 
points and other meridian points from the metabolic 
aspect. This fact may become a new useful information 
source to study the specificities of meridian points. 

To reveal the similarity and difference of the identified 
biomarkers regarding to meridian points, we calculate 
the overlaps of these biomarkers and present them in 
Figure 7. We found that Weizhong is slightly different 
from other meridian points. There is no overlapped 
metabolites for Weizhong and other meridian points. 
Compared to that, Zusanli, Yanglingquan, Juliao, and 
Liangmen are close to each other by sharing two com- 
mon metabolites: ppm 3.54 and 3.55 (Juliao is not 



ST21 GB34 




3.49 3.92 
Figure 7 Venn diagram for identified biomarkers from different 
acupuncture points. The similarity and difference of those 
identified biomarkers from Zusanli, Liangmen, Yanglingquan, and 
Weizhong are shown in a venn diagram. The overlapped biomarkers 
are indicated by their ppm and known annotation. 
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shown in Figure 7. The two biomarkers from Juliao are 
ppm 3.55 and 3.54 and totally included by Zusanli, Yan- 
glingquan, and Liangmen). In them 3.54 is annotated as 
a-glucose/glycine. Among the five points, Zusanli, Liang- 
men, and Juliao are on the same meridian. From the 
venn diagram in Figure 7, we can clearly see this trend. 
Yanglingquan has a unique biomarker with ppm 1.32. 
Our biomarker analysis indicates that acupuncture has 
some common molecules in metabolic level. At the 
same time, acupuncture at different meridian points has 
different molecular response. Our result is consistent 
with the theory on specificities of meridian points. 

Our results show that metabolite with chemical shift 
value 3.55 is clearly a common biomarker for ST36, 
ST21, ST3, and GB34. In Figure 8, we visualize the 
metabolic profiles as a two-dimensional graph and high- 
light this important molecule. The two dimensional 
graph, called the GEDI-"mosaics", provide a unique, 
one-glance visual engram that gives each high-dimen- 
sional sample a face. A characteristic of GEDI's analysis 



is that it does not prejudicate any particular structure in 
the data (such as clusters or hierarchical organization). 
Thus, it allows the researcher to use human pattern 
recognition to perform a global first-level analysis of the 
data [29] (GEDI is downloaded from http://www.chil- 
drenshospital.org/research/ingber/GEDI/gedihome.htm). 
It is clear that the highlighted metabolite has distinct 
expression value in case and control group (ST36 and 
Prel in Figure 8). This demonstrates the effectiveness of 
our biomaker identification method. 

Importantly, our LPFS method reveals the metabolite 
with ppm 1.33 as a biomarker for meridian points ST36 
and ST21. This molecule is annotated as lactate. Lactate 
has been extensively studied over years for many impor- 
tant functions. For example, the lactate has always been 
regarded as the central nervous system metabolic waste 
and a sign of hypoxia [30]. Also in recent years, evi- 
dences show that the role of lactate in brain energy 
metabolism should be re-recognized. Firstly, investiga- 
tions find that lactate is not only a very important 
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Figure 8 Highlight the selected biomaker in 2D plot. Metabolic sample is visualized as a two dimensional image. Each grid denotes a group 
of metabolites with similar profiles. Red color means the highly expressed metabolite group and blue color means the lowly expressed 
metabolite group. In particular, metabolite with chemical shift value 3.55 is highlighted in white color and indicated by the star. 
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energy resource for brain, but a sensor of brain energy 
homeostasis [31]; secondly, lactate in adult brain mainly 
comes from astrocytes, and it is a collaborative carrier 
for neuron and astrocytes [32,33]; thirdly, lactate plays 
an important role in coupling energy metabolism and 
functional activity [34]; and lastly, lactate has neuropro- 
tective effect in a number of pathological conditions. It's 
interesting to see lactate as a biomaker closely relating 
to acupuncture in our result. Whether lactate plays a 
substantial role in acupuncture effect at ST36 and ST21 
needs follow-up biological experiments. 

Discussions and conclusions 

Biomaker identification or feature selection considers 
the problem of constructing a prediction rule from only 
a feature-subset and accurately classifying the context of 
diagnosis and treatment observations (e.g. with vs. with- 
out acupuncture treatment). Such problems have 
become increasing important and quite general in geno- 
mics (identifying differentially expressed genes in micro- 
array data), proteomics (finding promising protein 
marker from the mass spectrometry data), metabolics 
(selecting metabolite markers from NMR, GC-MS data), 
and other areas of computational biology. Due to the 
number of features is much larger than the number of 
observations, simple, highly regularized approaches are 
in pressing need. Here, we proposed a novel linear pro- 
gramming based feature selection (LPFS) model to 
address this important problem. The feature selection 
problem is cast into an optimization problem with two 
objectives, one is to minimize the number of chosen fea- 
tures and the other is to maximize the predictive accu- 
racy. Mathematically the feature selection problem is 
formulated as a mixed integer linear programming pro- 
blem. Then the model is further relaxed to linear pro- 
gramming to ensure the efficient identification of a 
feature-subset. We can solve the in-essence combinator- 
ial optimization problem in a computational reasonable 
way. In summary, our LPFS method can select feature 
and learn the classifier in a joint way and we can select 
a small set of features by applying strong regularization. 
Our methodology is general and can be easily applied to 
other scenarios [35]. 

We extensively compared our LPFS method with 
existing methods in the real datasets on acupuncture 
treatment at different acupoints. We find that, 1). Our 
method can select the fewest features while achieve 
accurate predictions. 2). Our method is free of arbitrary 
threshold choice. 3). Close check of the selected feature 
shows that our method can identify those biological 
meaningful features. 4). In addition, the cross-validation 
results show that our method can achieve relatively high 
accuracy in prediction. 



Prior information allows further improvement of our 
method. Currently the identified biomarkers are inde- 
pendent to each other. We can move further step to 
interpretation by considering a group of biologically 
meaningful biomarkers. For example, we can incorpo- 
rate the network information (interactions among fea- 
tures) into the feature selection procedure. As a result, a 
pathway or modules in the network will be finally 
selected instead of single molecule as the biomarker, so 
called network biomarker. We note that prior informa- 
tion can be easily incorporated into our optimization 
model either by adding some constraints or penalizing 
in the objective function. 

In this paper, the biomarker identification for each 
acupuncture point is treated as a single binary classifica- 
tion task. We then compare the revealed biomarkers for 
their similarity and difference across different acupunc- 
ture points. We note that a multi-classifier can be devel- 
oped to systematically integrate all the profiles from 
different points together. This topic is in progress as our 
further direction. 

Finally, the metabolic profile is known for its high var- 
iance. We note that the main source of variance is from 
NMR technology instead of acupuncture effect [36]. To 
maximally reduce the variance from metabolic profile, 
we carefully design our experiments. Firstly, we used the 
relatively stable blood samples instead of urine sample. 
Secondly, we specifically use the Pareto scaling and 
orthogonal signal correction (OSC) method [37-39] to 
normalize the raw data, which will reduce the variance 
of samples inside each group and enhance the differ- 
ences among groups. Even with all these efforts, the 
remaining high variance may due to the change of 
environments and conditions and will eventually prevent 
the high accuracy for identification of biological mean- 
ingful biomarkers. In addition to variance, the limited 
number of sample in our study may also bring some 
potential effects on the results. From this viewpoint, 
these identified biomarkers should be carefully validated 
for their biological functions. Also, additional control 
study should be carefully designed to exclude other pos- 
sible cofactors. Further integration of the data from 
other levels, such as gene expression and proteomics 
levels, will improve the robustness of the identified 
biomarker. 
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